---
output:
  html_document: default
---
# Replication Guide for "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning" 

Ranjit Lall and Thomas Robinson

Last edit: October 2020

This document outlines replication and simulation reproduction details for the paper "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning."

To replicate all figures/tables in the paper, you should use the 'main_replication.R' file within main directory.

To enable an extensive reproduction of the full simulation tests, we have also included the underlying simulation scripts. **Some of these simulations take a long time to run (including a 3 week runtime for the column-wise speed test)** (see runtime section below), and are thus unsuitable for running on personal computers. 

* MAR-1, Kropko, and Application tests can be run on a mid-tier MacBook Pro
* Adult and Speed applications require a server instance to be executed

The remainder of this document is organised as follows: Section 2 describes the file structure for the replication materials (by file type), Section 3 lists hardware and software requirements, including how to install an archived version of MIDAS used for the simulations, Section 4 gives approximate runtimes for the simulations, Section 5 details generic information about how to call the respective simulation tests, and Section 6 provides guidance on how to initialise, set up and execute the simulation tests on a server instance.

## Files

This section details the replication and simulation scripts, data and results files included in the replication materials.

*Working directory should be set to the replication package's main directory.*

### Replication files

Complete replication of all results presented in the main text and Online Appendix.

1. main_replication.R

### Simulation files

Underlying code to run simulation tests.

*R*:

1. mar1/mar1_data_gen.R
2. mar1/mar1_midas.R
3. kropko/kropko_sim.R
4. kropko/kropko_anes.R
5. adult/adult.R
6. speed/speed_cols.R
7. speed/speed_rows.R

*Python*:

1. mar1/mar1_midas.py
2. kropko/kropko_midas_mesh_contr.py
3. kropko/kropko_anes_midas_imps.py [Appendix]
4. adult/adult.py
5. speed/cces_cols.py
6. speed/cces_rows.py
7. application/cces_application.py
8. wdi/wdi_demo.py [Appendix]


#### Helper files

Additional files sourced within simulations:

1. kropko/helper2.R
2. kropko/midas_handler_cont.R
3. kropko/kropko_midas_mesh_cont.py
4. kropko/kropko_anes_test_helper.R

### Raw data files

Data files adapted from various sources (see citations)

1. data/adult_data.csv -- copy of the Adult Data, UC Irving Machine Learning[^adult_fn]
2. data/base2008_3.dta -- copy of the 2008 ANES data, as per Kropko et al. (2014)[^kropko_fn]
3. data/cces_subset.csv -- formatted CCES data for use in simulations underlying Figures 6 and 7[^cces_fn]
4. data/cces_format.csv -- formatted CCES data for use in Figure 8 and Table 1[^cces_fn]
5. data/raw_numbers.csv -- data for WDI simulation in Figure A4

[^adult_fn]:
  Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  
[^kropko_fn]:
  Kropko, Jonathan; Goodrich, Ben; Gelman, Andrew; Hill, Jennifer, 2014, "Replication data for: Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches", https://doi.org/10.7910/DVN/24672, Harvard Dataverse, V3, UNF:5:QuxE8nFhbW2JZT+OW9WzWw== [fileUNF] 
  
[^cces_fn]:
  Brian Schaffner; Stephen Ansolabehere; Sam Luks, 2019, "CCES Common Content, 2018", https://doi.org/10.7910/DVN/ZSBZ7K, Harvard Dataverse, V6, UNF:6:hFVU8vQ/SLTMUXPgmUw3JQ== [fileUNF] 


### Results files

Where file sizes are relatively small, we have included the "raw" missing and imputed datasets. For some tests -- Kropko, Adult, Speed -- the number or size of the imputation files is prohibitive and are therefore not saved. The output results files are, however, included for analysis/replication.

*MAR-1*:

1. mar1/mar1_midas_results.csv
2. mar1/mar1_r_results.csv
3. mar1/data_tmp -- copies of missing and imputed datasets

*Kropko*:

4. kropko/highcorr_results.csv
5. kropko/lowcorr_results.cv
6. kropko/anes_midas [Appendix] -- copies of missing and imputed datasets

*Adult*:

7. adult/adult_test_results.csv

*Speed*:

8. speed/speed_col_results.csv
9. speed/speed_row_results.csv

*Application*:

10. application/data_tmp -- copies of MIDAS imputed datasets

*WDI [Appendix]*:

11. [CIV/CMR/COG/GHA/NER/ZMB]_output.csv -- country-specific imputations results

### Other files

1. README.md -- details for journal replication
2. package_dependencies.R -- R script to load packages on server instance(for simulations)
3. makefile -- replications commands to run at command line
4. MIDAS-master.zip -- source files for original MIDAS python module used in simulations

## Prerequisites

### Hardware

All paper replication files (figures and tables) generated using:

13" MacBook Pro (13-inch, 2017):
* 2.3 GHz Dual-Core Intel Core i5
* 8GB RAM
* macOS Catalina (10.15.5)

Simulations generated using:

Amazon Web Service (AWS):

* m5.xlarge instance
* Ubuntu 18.04

### Software

#### MIDAS Python class

To install the version of the MIDAS Python class (MIDASpy) used for simulation in the paper, enter the following command at the command line:
`pip install git+https://github.com/Oracen/MIDAS.git`

Python 3.7

Packages:

* Python (3.7.3)
* Numpy (1.18.1)
* Pandas (0.24.2)
* Tensorflow (1.14.0)
* Matplotlib (3.1.0)
* Sklearn (0.21.2)

#### R 

Version 3.6

Packages:

* tidyverse (1.3.0)
* xtable (1.8-4)
* Amelia (1.7.5)
* mice (3.6.0)
* ggpubr (0.2.1)
* reshape2 (1.4.3)

Additional packages for simulation:

* mi (1.0)
* mvnmle (0.1-11.1)
* MASS (7.3-51.5)
* norm (1.0-9.5)
* nnet (7.3-12)
* arm (1.10-1)
* dplyr (0.8.4)
* readr (1.3.1)
* purrr (0.3.3)
* betareg (3.1-2)
* norm2 (2.0.2)
* doParallel (1.0.15)
* ggplot2 (3.2.1)
* haven (2.3.1)
* foreign (0.8-75)

## Simulation Overview: Runtimes

We conducted the MAR-1, Kropko, and CCES application simulation tests on a MacBook Pro (dual-core Intel Core i5-7360U @ 2.3GHz, 8.00GB RAM, macOS 10.15.5). The approximate runtimes are:

* MAR-1 -- 5 minutes
* Kropko -- 11 hours
* Application -- 4 hours

The adult imputation accuracy test and both speed tests (column-wise and row-wise) should be executed on a server instance. Runtimes using an m5.xlarge instance were approximately:

* Adult -- 115 hours continuous runtime on server
* Speed (column) -- 265 hours continuous runtime on server
* Speed (row) -- 76 hours continuous runtime on server

NOTE: server runtimes may be slightly longer since these estimations exclude failed runs of Amelia.

## Simulation Overview: Running simulations from the main paper

Our simulation studies are computationally intensive and rely on simultaneous R/Python calls. We therefore provide a makefile within the main directory (to run the files in the correct order).

This section details the contents of the makefile, and generic instructions for running each simulation test. The final section of this guide provides AWS specific instructions for runnning these replications.

* To run all analyses: `make` [NOTE: Due to computational intensity, this is not recommended]
* Run MAR-1 simulation: `make mar1`
* Run Kropko MVN simulation: `make kropko`
* Run adult simulation: `make adult`
* Run speed comparison simulations: `make speedcol` and `make speedrow`
* Run CCES ideology example: `make application`
* Generate all figures and tables: `make replication`

### MAR-1 

`make mar1` 

Or run sequentially at command line: Rscript --no-save mar1/mar1_data_gen.R >> python3 mar1/mar1_midas.py >> Rscript --no-save mar1/mar_midas.R

### Kropko

`make kropko`

Or run *simultaneously* at command line: Rscript --no-save kropko/kropko_sim.R & python3 kropko/kropko_midas_mesh_cont.py

The Kropko et al. code is adapted from Kropko et al. replication code here (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/24672).

Since MIDAS runs in Python, we mesh the existing R script with a MIDAS Python script by running the two in parallel. Each time an MVN dataset is generated, the R script creates a marker file and waits. The MIDAS script detects the markerfile, runs the imputation code, then deletes the marker file. The R script then resumes. The Python script terminates once the full results file has been generated.

### Adult tests

`make adult`

Or run *simultaneously* at command line: python3 adult/adult.py & Rscript --no-save adult/adult.R

Like the Kropko test, the adult data imputation tests rely on simultaneous execution of a MIDAS Python script and R. 

### Speed tests

**Column-wise test**

`make speedcol`

Or run *simultaneously* at command line: python3 speed/cces_cols.py & Rscript --no-save speed/speed_cols.R

**Row-wise test**

`make speedrow`

Or run *simultaneously* at command line: python3 speed/cces_cols.py & Rscript --no-save speed/speed_cols.R

### Application (CCES test)

`make application`

Or run at command line: python3 application/cces_application.py


## Setting up an AWS Instance:

This section details how to reproduce the simulation tests using a server instance on AWS. Given runtimes (and costs involved) we recommend running each test separately. 

The following instructions assume an AWS m5.xlarge instance has been activated, that the user has created and linked an encryption key, and is using a Unix operating system.

Tunnel into the instance from the command line (your authentication key should be stored in the current directory):

`ssh -i [your_authentication_key_name.pem] ubuntu@[server.address]`

From the server instance, load R by running the following commands:

```
sudo apt install apt-transport-https software-properties-common
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/'
sudo apt update
sudo apt install r-base
sudo apt install build-essential
```

Install python3 pip interface and install software:
```
sudo apt install python3-pip

pip3 install tensorflow==1.14.0
pip3 install numpy
pip3 install pandas
pip3 install matplotlib
pip3 install sklearn
pip3 install git+https://github.com/Oracen/MIDAS.git

```

Manually install at least one package in R to establish personal library. You will need to respond 'yes' when it asks you to set up a new personal library etc.  You do *not* need to save your workspace image upon quitting. You can do this with the following commands:
```
R
install.packages("dplyr")
quit()
```

Set up file structure to mirror replication files, either manually or using software like CyberDuck.

Now run the package install R script automatically:
`Rscript --no-save package_dependencies.R`

## To run the simulation files

In order to disconnect from the server without terminating the simulation tests, we need to use the `screen` utility to create a persistent terminal session on the server instance. To set up screen so that the code keeps running when you exit the instance type:
`screen -S [SCREEN NAME]`

Then call the specific make command, e.g.:
`make speedcol`

To exit while keeping the terminal session running: ctrl+a+d

To re-enter process:
`screen -r [SCREEN NAME]`

To kill screen (terminating all code within): ctrl+a+k

To increase scrollback to 1000 lines (in order to monitor progress better):
Ctrl+a and then type `: scrollback 1000` (including colon).


